tidyverse()Overview by Wickham and Grolemund (2017):
| dplyr (tidyverse) | base::merge |
|---|---|
inner_join(x, y) |
merge(x, y) |
left_join(x, y) |
merge(x, y, all.x = TRUE) |
right_join(x, y) |
merge(x, y, all.y = TRUE), |
full_join(x, y) |
merge(x, y, all = TRUE) |
arrange()select()filter()mutate()group_by()summarize()On Canvas, open the survey “LockDown browser test”. Answer the questions.
You must install the LockDown browser on your computer
Consider the following code:
temperature_conversions <- data.frame(conversion_factor = c(5/9, 1),
scale = c("Fahrenheit", "Celsius"))
temperature <- data.frame(city = c("StGallen", "Zürich", "Detroit"),
temp = c(12, 14, 21),
scale = c("Cel", "Cel", "F"))
Select statements that are true:
inner_join(temperature, temperature_conversions, by="scale") returns a data frame with 3 rows.left_join(temperature, temperature_conversions, by="scale") returns a data frame with 3 rows.full_join(temperature, temperature_conversions, by="scale") returns a data frame with 3 rows.Consider the following data frame:
> main_dataset
city temp scale conversion_factor
1 StGallen 12 Celsius 1.0000000
2 Zürich 14 Celsius 1.0000000
3 Detroit 40 Fahrenheit 0.5555556
Select statements that are true:
main_dataset |> mutate(temp_celsius = ifelse(scale == "Fahrenheit", (temp-32) * conversion_factor, temp)) replaces the variable temp with temp_celsiusmain_dataset |> mutate(temp_celsius = ifelse(scale == "Fahrenheit", (temp-32) * conversion_factor, temp)) has 3 rows and 5 columnsmain_dataset |> summarize(mean_temp = mean(temp), min_temp = min(temp)) returns a tibble containing 2 columns and 1 rowmain_dataset |> summarize(mean_scale = mean(scale), sd_scale = sd(scale)) is a good way to get summary statistics about the variable scaleConsider the following code:
Ostschwiz <- data.frame(
Region = c("St. Gallen", "Appenzell", "Appenzell"),
FavoriteBeverage = c("Schützengarten", "Quöllfrisch", "Appenzeller"),
Year = c(2021, 2021, 2022)
)
Valais <- data.frame(
Region = c("Valais"),
FavoriteBeverage = c("Fendant"),
Year = c(2021)
)
Vaud <- data.frame(
Region = c("Vaud"),
FavoriteBeverage = c("Chasselas"),
Year = c(2021)
)
The statement nrow(rbind(Ostschwiz, Valais, Vaud)) == 5 returns FALSE
Consider the following data frame
Ostschwiz <- data.frame(
Region = c("St. Gallen", "Appenzell", "Appenzell"),
FavoriteBeverage = c("Schützengarten", "Quöllfrisch", "Appenzeller"),
Year = c(2021, 2021, 2022)
)
Valais <- data.frame(
Region = c("Valais"),
FavoriteBeverage = c("Fendant"),
Year = c(2021)
)
Vaud <- data.frame(
Region = c("Vaud"),
FavoriteBeverage = c("Chasselas"),
Year = c(2021)
)
Ostschwiz <- pivot_wider(Ostschwiz,
names_from = c(Year),
values_from = FavoriteBeverage)
The statement colnames(Ostschwiz)[2] returns "FavoriteBeverage
Two ways: display data through tables or graphs.
Depends on the purpose.
A chart typically contains at least one axis, the values are represented in terms of visual objects (dots, lines, bars) and axes typically have scales or labels.
A table typically contains rows and columns, and the values are represented by text.
# load packages and data
library(tidyverse)
data("swiss")
# compute summary statistics
swiss_summary <- swiss |>
summarise(avg_education = mean(Education),
avg_fertility = mean(Fertility),
N = n()
)
swiss_summary
## avg_education avg_fertility N ## 1 10.97872 70.14255 47
Problems?
swiss_summary_rounded <- round(swiss_summary, 2) swiss_summary_rounded
## avg_education avg_fertility N ## 1 10.98 70.14 47
format()-functionformat() exampleswiss_form <- format(swiss_summary_rounded,
decimal.mark = ",")
swiss_form
## avg_education avg_fertility N ## 1 10,98 70,14 47
toupper()/tolower().trimws(),string <- "AbCD " toupper(string)
## [1] "ABCD "
tolower(string)
## [1] "abcd "
trimws(tolower(string))
## [1] "abcd"
gtExtras and sparklineshead(USArrests, 10)
## Murder Assault UrbanPop Rape ## Alabama 13.2 236 58 21.2 ## Alaska 10.0 263 48 44.5 ## Arizona 8.1 294 80 31.0 ## Arkansas 8.8 190 50 19.5 ## California 9.0 276 91 40.6 ## Colorado 7.9 204 78 38.7 ## Connecticut 3.3 110 77 11.1 ## Delaware 5.9 238 72 15.8 ## Florida 15.4 335 80 31.9 ## Georgia 17.4 211 60 25.8
Problems?
gtExtras and sparklineslibrary(gtExtras)
USArrests_summary <- USArrests |>
mutate(UrbanPop = case_when(UrbanPop > quantile(UrbanPop, .66) ~ "High",
UrbanPop > quantile(UrbanPop, .33) ~ "Middle",
UrbanPop > 0 ~ "Low")) |>
group_by(UrbanPop) |>
summarize(
"Mean murder" = mean(Murder),
"SD murder" = sd(Murder),
Density = list(Murder),
.groups = "drop"
)
gtExtras and sparklinesUSArrests_summary |>
gt() |>
tab_header(
title = md("Murder rates"),
subtitle = md("Per high, middle, and low urban population ")
) |>
gtExtras::gt_plt_dist(Density, type = "density", line_color = "black",
fill_color = "red") %>%
fmt_number(columns = `Mean murder`:`SD murder`, decimals = 2)
| Murder rates | |||
| Per high, middle, and low urban population | |||
| UrbanPop | Mean murder | SD murder | Density |
|---|---|---|---|
| High | 8.07 | 3.80 | |
| Low | 7.41 | 5.20 | |
| Middle | 7.89 | 4.18 | |
kable() for html / Markdown reportsstargazer for your LaTeX reports or for your Office Word reportskable()knitr::kable(head(USArrests, 5), format = "markdown")
| Murder | Assault | UrbanPop | Rape | |
|---|---|---|---|---|
| Alabama | 13.2 | 236 | 58 | 21.2 |
| Alaska | 10.0 | 263 | 48 | 44.5 |
| Arizona | 8.1 | 294 | 80 | 31.0 |
| Arkansas | 8.8 | 190 | 50 | 19.5 |
| California | 9.0 | 276 | 91 | 40.6 |
kable()knitr::kable(head(USArrests, 5), format = "html")
| Murder | Assault | UrbanPop | Rape | |
|---|---|---|---|---|
| Alabama | 13.2 | 236 | 58 | 21.2 |
| Alaska | 10.0 | 263 | 48 | 44.5 |
| Arizona | 8.1 | 294 | 80 | 31.0 |
| Arkansas | 8.8 | 190 | 50 | 19.5 |
| California | 9.0 | 276 | 91 | 40.6 |
Three main approaches:
graphics package ((R Core Team 2018); shipped with the base R installation).Three main approaches:
graphics package ((R Core Team 2018); shipped with the base R installation).lattice package (Sarkar 2008), an implementation of the original Bell Labs ‘Trellis’ system.Three main approaches:
graphics package ((R Core Team 2018); shipped with the base R installation).lattice package (Sarkar 2008), an implementation of the original Bell Labs ‘Trellis’ system.ggplot2 package (Wickham 2016), an implementation of Leland Wilkinson’s ‘Grammar of Graphics’.Three main approaches:
graphics package ((R Core Team 2018); shipped with the base R installation).lattice package (Sarkar 2008), an implementation of the original Bell Labs ‘Trellis’ system.ggplot2 package (Wickham 2016), an implementation of Leland Wilkinson’s ‘Grammar of Graphics’.ggplot2 is so good that it has become THE reference [In python, use plotnine to apply the grammar of graphics.]
ggplot2ggplot2 basicsUsing ggplot2 to generate a basic plot in R is quite simple. Basically, it involves three key points:
data.frame/tibble (in tidy format!).ggplot2 basicsUsing ggplot2 to generate a basic plot in R is quite simple. Basically, it involves three key points:
data.frame/tibble (in tidy format!).ggplot().ggplot2 basicsUsing ggplot2 to generate a basic plot in R is quite simple. Basically, it involves three key points:
data.frame/tibble (in tidy format!).ggplot().ggplot2 basicsUsing ggplot2 to generate a basic plot in R is quite simple. Basically, it involves three key points:
data.frame/tibble (in tidy format!).ggplot().ggplot(data = my_dataframe, aes(x= xvar, y= yvar))
swisslibrary(tidyverse) # automatically loads ggplot2 # load the data data(swiss) head(swiss)
## Fertility Agriculture Examination Education Catholic Infant.Mortality ## Courtelary 80.2 17.0 15 12 9.96 22.2 ## Delemont 83.1 45.1 6 9 84.84 22.2 ## Franches-Mnt 92.5 39.7 5 5 93.40 20.2 ## Moutier 85.8 36.5 12 7 33.77 20.3 ## Neuveville 76.9 43.5 17 15 5.16 20.6 ## Porrentruy 76.1 35.3 9 7 90.57 26.6
Code a province as ‘Catholic’ if more than 50% of the inhabitants are catholic:
# via tidyverse/mutate
swiss <- mutate(swiss,
Religion =
ifelse(50 < Catholic, 'Catholic', 'Protestant'))
# 'old school' alternative
swiss$Religion <- 'Protestant'
swiss$Religion[50 < swiss$Catholic] <- 'Catholic'
# set to factor
swiss$Religion <- as.factor(swiss$Religion)
ggplot(data = swiss, aes(x = Education, y = Examination))
ggplot(data = swiss, aes(x = Education, y = Examination)) +
geom_point()
ggplot(data = swiss, aes(x = Education, y = Examination)) +
geom_point() +
facet_wrap(~Religion)
ggplot(data = swiss, aes(x = Education, y = Examination)) +
geom_point() +
geom_smooth(method = 'loess') +
facet_wrap(~Religion)
ggplot(data = swiss, aes(x = Education, y = Examination)) +
geom_point() +
geom_smooth(method = 'lm') +
facet_wrap(~Religion)
ggplot(data = swiss, aes(x = Education, y = Examination)) +
geom_point(aes(color = Agriculture)) +
geom_smooth(method = 'lm') +
facet_wrap(~Religion)
ggplot(data = swiss, aes(x = Education, y = Examination)) +
geom_point(aes(color = Agriculture)) +
geom_smooth(method = 'lm') +
facet_wrap(~Religion) +
coord_flip()
ggplot(data = swiss, aes(x = Education, y = Examination)) +
geom_point(aes(color = Agriculture)) +
geom_smooth(method = 'lm') +
facet_wrap(~Religion) +
theme(legend.position = "bottom", axis.text=element_text(size=12) )
ggplot(data = swiss, aes(x = Education, y = Examination)) +
geom_point(aes(color = Agriculture)) +
geom_smooth(method = 'lm') +
facet_wrap(~Religion) +
theme_minimal()
library(ggthemes)
ggplot(data = swiss, aes(x = Education, y = Examination)) +
geom_point() +
geom_smooth(method = 'lm') +
facet_wrap(~Religion) +
theme_economist()
Link: https://rstudio.github.io/cheatsheets/html/data-visualization.html
Values are represented by their position relative to the axes: line charts and scatterplots.
Values are represented by the size of an area: bar charts and area charts.
Values are continuous: use chart type that visually connects elements (line chart).
Values are categorical: use chart type that visually separates elements (bar chart).
(Source: https://hhsievertsen.github.io/EconDataBook/data-visualization-basics.html)
Recommendations from Tufte
We can quantify the Lie Factor of a graph
Lie Factor = \(\frac{\text{size of effect shown in graphic}}{\text{size of effect in data}}\)
Only what matters should be reported (Data-ink Ratio):
Data-ink Ratio = \(\frac{\text{ink used for data}}{\text{total ink used to print the graphic }}\)
Data visualization is an art of story-telling, deception, and scientific exactitude 🤓.
R Core Team. 2018. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.
Sarkar, Deepayan. 2008. Lattice: Multivariate Data Visualization with r. New York: Springer. http://lmdvr.r-forge.r-project.org.
Wickham, Hadley. 2016. Ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York. http://ggplot2.org.
Wickham, Hadley, and Garrett Grolemund. 2017. Sebastopol, CA: O’Reilly. http://r4ds.had.co.nz/.